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Abstract 

Learning network structure underlying data is an 
important problem in machine learning. This pa¬ 
per presents a novel degree prior to study the in¬ 
ference of scale-free networks, which are widely 
used to model social and biological networks. In 
particular, this paper formulates scale-free net¬ 
work inference using Gaussian Graphical model 
(GGM) regularized by a node degree prior. Our 
degree prior not only promotes a desirable global 
degree distribution, but also exploits the esti¬ 
mated degree of an individual node and the rela¬ 
tive strength of all the edges of a single node. To 
fulfill this, this paper proposes a ranking-based 
method to dynamically estimate the degree of 
a node, which makes the resultant optimization 
problem challenging to solve. To deal with this, 
this paper presents a novel ADMM (alternating 
direction method of multipliers) procedure. Our 
experimental results on both synthetic and real 
data show that our prior not only yields a scale- 
free network, but also produces many more cor¬ 
rectly predicted edges than existing scale-free in¬ 
ducing prior, hub-inducing prior and the l\ norm. 


1. Introduction 

Graphical models are widely used to describe the rela¬ 
tionship between variables (or features). Estimating the 
structure of an undirected graphical model from a dataset 
has been extensively studied (Meinshausen & Buhlmann, 
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2006; Yuan & Lin, 2007; Friedman et ah, 2008; Banerjee 
et ah, 2008; Wainwright et ah, 2006). Gaussian Graph¬ 
ical Models (GGMs) are widely used to model the data 
and 1 1 penalty is used to yield a sparse graph structure. 
GGMs assume that the observed data follows a multivari¬ 
ate Gaussian distribution with a covariance matrix. When 
GGMs are used, the network structure is encoded in the 
zero pattern of the precision matrix. Accordingly, struc¬ 
ture learning can be formulated as minimizing the nega¬ 
tive log-likelihood (NLL) with an 1 1 penalty. However, the 
widely-used l-\ penalty assumes that any pair of variables 
is equally likely to form an edge. This is not suitable for 
many real-world networks such as gene networks, protein- 
protein interaction networks and social networks, which are 
scale-free and contain a small percentage of hub nodes. 

A scale-free network (BARABASI & Bonabeau, 2003) has 
node degree following the power-law distribution. In par¬ 
ticular, a scale-free network may contain a few hub nodes, 
whose degrees are much larger than the others. That is, a 
hub node more likely forms an edge than the others. In 
real-world applications, a hub node is usually functionally 
important. For example, in a gene network, a gene playing 
functions in many biological processes (Zhang & Horvath, 
2005; Goh et ah, 2007) tends to be a hub. A few methods 
have been proposed to infer scale-free networks by using a 
reweighed 1\ norm. For example, (Peng et ah, 2009) pro¬ 
posed a joint regression sparse method to learn scale-free 
networks by setting the penalty proportional to the esti¬ 
mated degrees. (Candes et ah, 2008) proposed a method 
that iteratively reweighs the penalty in the l\ norm based 
on the inverse of the previous estimation of precision ma¬ 
trix. This method can suppress the large bias in the 1 i norm 
when the magnitude of the non-zero entries vary a lot and 
is also closer to the Iq norm. Some recent papers follow the 
idea of node-based learning using group lasso (Friedman 
et ah, 2010b; Meier et ah, 2008; Tan et ah, 2014; Fried¬ 
man et ah, 2010a; Mohan et ah, 2014). Since group lasso 
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penalty promotes similar patterns among variables in the 
same group, it tends to strengthen the signal of hub nodes 
and suppress that of non-hub nodes. As such, group lasso 
may only produce edges adjacent to nodes with a large de¬ 
gree (i.e., hubs). 

In order to capture the scale-free property, (Liu & Ihler, 

2011 ) minimizes the negative log-likelihood with a scale- 
free prior, which approximates the degree of each variable 
by the li norm. However, the objective function is non- 
convex and thus, is hard to optimize. (Defazio & Caetano, 

2012 ) approximates the global node degree distribution by 
the submodular convex envelope of the node degree func¬ 
tion. The convex envelope is the Lovasz extension (Lovasz, 
1983; Bach, 2010) of the sum of logarithm of node degree. 
These methods consider only the global node degree distri¬ 
bution, but not the degree of an individual node. However, 
the approximation function used to model this global de¬ 
gree distribution may not induce a power-law distribution, 
because there are many possible distributions other than 
power-law that optimizing the approximation function. See 
Theorem 1 for details. 

To further improve scale-free network inference, this pa¬ 
per introduces a novel node-degree prior, which not only 
promotes a desirable global node degree distribution, but 
also exploits the estimated degree of an individual node and 
the relative strengths of all the edges adjacent to the same 
node. To fulfill this, we use node and edge ranking to dy¬ 
namically estimate the degree of an individual node and the 
relative strength of one potential edge. As dynamic ranking 
is used in the prior, the objective function is very challeng¬ 
ing to optimize. This paper presents a novel ADMM (alter¬ 
nating direction method of multipliers) (Boyd et ah, 2011; 
Fukushima, 1992) method for this. 

2. Model 

2.1. Notation & Preliminaries 

Let G = (V. E) denote a graph, where V is the vertex 
set (p = 1^1) and E the edge set. We also use E to de¬ 
note the adjacency matrix of G, i.e., E,j = 1 if and only 
if (*, j) £ E. In this paper we assume that the variable 
set V has a Gaussian distribution with a precision matrix 
X. Then, the graph structure is encoded in the zero pat¬ 
tern of the estimated X, i.e. the edge formed by X is 
E(X) = {(u,v),X uv ^ 0}. Let F(X) denote the objec¬ 
tive function. ForGGMs, E(X) = tr(XS ) — logdet(X), 
where S is the empirical covariance matrix. To induce a 
sparse graph, a l\ penalty is applied to obtain the following 
objective function. 

X = argminFpf) + A||X||i (1) 

X 

Here we define two ranking functions. For a given i , 


let (.) denote the permutation of ( 1 , 2 , ...,p — 1 ) such 
that |X ii(1) | > |X ii(2) | > ... > l-X'i.Cp-i)! where 
(X* m, ^Q,( 2 ); X l ( p _ 1 \) is a shuffle of a row vector X, 
excluding the element Xi t j. Let M be a p — 1 dimension 
positive vector and V a p dimension vector. We define 
Xi O M = YZZ i |*i,(„)| M u . Let [X,M,V] be a per¬ 
mutation of ( 1 , 2 ,... ,p ) such that Xn ,x,m,v] ° M + V i > 
X [2 ,x,m,v] ° M + V 2 > ... > X\p X ,M,v] ° M + V p , where 
[j, X, M, V] denote the j th element of [X, M, V}. When 
V is a zero vector, we also write [X, M , V\ as [A', M) and 
[j, X, M, V} as [j, X, M], respectively. 

To promote a sparse graph, a natural way is to use a l\ 
penalty to regulate F(X). However, Z| norm cannot in¬ 
duce a scale-free network. This is because that a scale-free 
network has node degree following a power-law distribu¬ 
tion, i.e., P(d ) oc d -7 where d is the degree and 7 is a 
constant ranging from 2 to 3, rather than a uniform distri¬ 
bution. A simple prior for scale-free networks is the sum 
of the logarithm of node degree as follows. 

p 

A^7l°g(fi,; + !), (2) 

'll— 1 

where d v = J^u=i I[X V u / 0 ,# / «] and / is the indi¬ 
cator function. The constant 1 is added to handle the situa¬ 
tion when d v = 0. The l 0 norm in (2) is non-differentiable 
and thus, hard to optimize. One way to resolve this prob¬ 
lem is to approximate d v by the l\ norm of X v as shown 
in (Liu & Ihler, 2011). Since the logarithm of U approx¬ 
imation is non-convex, a convex envelope of ( 2 ) based on 
submodular function and Lovasz extension (Lovasz, 1983) 
is introduced recently in the papers (Bach, 2010; Defazio 
& Caetano, 2012) as follows. 

pp -i 

EE |^,(u)|(log(w+ 1) - log(u)J. (3) 

V=1 U= 1 

Let h(u) = (log(rt + 1) — log(rt)). The convex envelope 
of Eq. 2 can be written as 

p p -1 

53Em«)|x„ >(u) i. (4) 

V — l u= 1 

2.2. Node-Specific Degree Prior 

Although Eq. 2 and its approximations can be used to in¬ 
duce a scale-free graph, they still have some issues, as ex¬ 
plained in Theorem 1 . 

Theorem 1. Let S(G ) denote the sum of logarithm of node 
degree of a graph G. For any graph G satisfying the scale- 
free property, there exists another graph G' with the same 
number of edges but not satisfying the scale-free property 
such that S(G') < S(G). 
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Since we would like to use (2) as a regularizer, a smaller 
value of (2), denoted as S(G), is always favorable. Here 
we briefly prove Theorem 1. Since G is scale-free, a large 
percentage of its nodes have small degrees. It is reasonable 
to assume that G has two nodes v! and v' with the same 
degree. Let a denote their degree. Without loss of general¬ 
ity, there exists a node x connecting to u' but not to v'. We 
construct a graph Gr [1 from G by connecting x to v' and 
disconnecting x from u'. Since log is a concave function, 
we have 


S(G W ) = S(G) - 2 log (a + 1) + log(o) + log(a + 2) 
< S(G) (5) 

If there are two nodes with the same degree in G^\ we 
can construct G^ +1 ^ from such that S{G { ' +1 " > ) < 
S(G«) < .... < S(G^) < S(G). By this procedure, 
we may finally obtain a non-scale-free graph with a smaller 
S(G) value. So, Theorem 1 is proved. 

Since Theorem 1 shows that a global description of scale- 
free property is not the ideal choice for inducing a scale- 
free graph, we propose a new prior to describe the degree 
of each individual node in the target graph. 

Intuitively, given a node it, if we rank all the potential edges 
adjacent to u in descending order by \X U>V | where v £ V — 
it, then (it, v) with a higher rank is more likely to be a true 
edge than those with a lower rank and should receive less 
penalty. To account for this intuition, we introduce a non¬ 
decreasing positive sequence H and the following prior for 
node it 

p -1 

HoX u = Y j H v \X uM \. ( 6 ) 

V—l 

Here (v) represents the node ranked in the v th position. 
When all the elements in H are same, this prior simply is l\ 
norm. In our work, we use a moderately increasing positive 
sequence H (e.g. H v = log(u + 1)“ fori; = 1, 2, 1), 

such that a pair (it, zt) with a larger estimated has 

a smaller penalty. On the other hand, we do not want to 
penalize all the edges very differently since the penalty is 
based upon only rough estimation of the edge strength. 

Let d u denote the degree of node it, we propose the follow¬ 
ing prior based on Eq. 6 . 


E 


HoX u 

H du 


(7) 


Comparing to the l\ norm, our prior in Eq.7 is non-uniform 
since each edge has a different penalty, depending on its 
ranking and the estimated node degrees. This prior has the 
following two properties. 


• When v < d u , the penalty for X u j vj is less than or 
equal to 1 ; otherwise, the penalty is larger than 1 . 


• If node i has a higher degree than node j, then X l(vj 
has a smaller penalty than Xjj u y 


The first property implies that Eq.(7) favors a graph fol¬ 
lowing a given degree distribution. Suppose that in the true 
graph, node (or variable) i (i = 1 , 2 ,has degree di. 
Let E and denote two edge sets of the same size. 
If E W has the degree distribution (di, d 2 ,..., d p ) but E^ 
does not, then we can prove the following result (see Sup¬ 
plementary for proof). 


E 


H o E ( u ] 

H du 


^E 


H o E {2) 

H du 


( 8 ) 


The second property implies that Eq.(7) is actually ro¬ 
bust. Even if we cannot exactly estimate the node degree 
(di,d 2 ) •••) d p ), Eq.(7) may still work well as long as we 
can accurately rank nodes by their degrees. 


2.3. Dynamic Node-Specific Degree Prior 


We have introduced a prior (7) that favors a graph with a 
given degree distribution. Now the question is how to use 
(7) to produce a scale-free graph without knowing the exact 
node degree. 


Let denote the expected number of nodes with degree 
k. We can estimate from the prior degree distribution 
Po(d) oc dr 1 when 7 is given. Supposing that all the nodes 
are ranked in descending order by their degree, we use r t to 
denote the estimated degree of the i th ranked node based on 
the power-law distribution, such that t\ > t 2 > ■ ■ ■ > t p . 
Further, if we know the desired number of predicted edges, 
say we are told to output a edges, then the expected degree 
tL of node v is assumed to be proportional to a x r . 

In the following content, we would just use t v to denote the 
expected degree of node v. 


Now the question is how to rank nodes by their degrees? 
Although the exact degree d v of node v is not available, 
we can approximate log(d„ + 1) by Lovasz Extenion (De- 
fazio & Caetano, 2012; Bach, 2010; Lovasz, 1983), i.e., 
\ X v,{u)\ (log(u + 1) - log(u)) (see Eq. 2). That 
is, we can rank nodes by their Lovasz Extension. Note that 
although we use Lovasz Extension in our implementation, 
other approximations such as (1 norm can also be used to 
rank nodes without losing accuracy. 


We define our dynamic node-specific prior as follows. 


n(x) = 

v—l 


X[v,x,h] 0 H 

Ur„ 


(9) 


Note that h(u) = log(u + 1) — log(it) for 1 < u < p and 

[A', h\ defines the ranking based on Lovasz Extension. The 
i-th ranked node is assigned a node penalty // l , denoted 
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as g(i). Note that the ranking of nodes by their degrees is 
not static. Instead, it is determined dynamically by our op¬ 
timization algorithm. Whenever there is a new estimation 
of X, the node and edge ranking may be changed. That is, 
our node-specific degree prior is dynamic instead of static. 

3. Optimization 

With prior defined in (9), we have the following objective 
function: 

F(x)+/3n(X) (10) 

Here /3 is used to control sparsity level. The challenge of 
minimizing Eq.(10) lies in the fact that we do not know 
the ranking of both nodes and edges in advance. Instead 
we have to determine the ranking dynamically. We use an 
ADMM algorithm to solve it by introducing dual variables 
Y and U as follows. 

X l+1 = argmin F(X) + %\\X-Y l + U l f F (11) 
Y l+1 = arg min /3fl(Y) + ^\\X l+1 -Y + U l \\ 2 F (12) 
U l+1 =U l + X l+1 - Y l+1 (13) 

We can use a first-order method such as gradient descent 
to optimize the first subproblem since F(X) is convex. In 
this paper, we assume that F(x) is a Gaussian likelihood, 
in which case ( 11 ) can be solved by eigen-decomposition. 
Here we describe a novel algorithm for (12). 

Let A = X l+1 + U l and A = Minimizing (12) is equiv¬ 
alent to 

+ ( 14 ) 

which can be relaxed as 

min7, 11 ^ - a \\f + ° H 

~ V — l 

s.t. g([v]) = g([v,Y,h]) 1 < v < p. (15) 

Here, we simply use [•] to denote a permutation of 
{ 1 , 2 , ...,p — 1 }, and use [u] to denote the v th element 
of this permutation. The reason we introduce (15) is, 
given Y and without the constraint of (15), the optimal 
[.] is [y, H]. Adding the constraint g([u]) = </([u, Y, h]), 
(15) can be relaxed by solving the following problem until 
g([v,Y,H,S]) = g([v,Y,h\) 1 <v<p. 

1 p 

min-||y - A\\ 2 f + 9( v ){ y [ v ,y,h,s] ° H } 

’ U = 1 

( 16 ) 


Here S is the dual vector and can be updated by S(v ) = 
g ■ (g([v, Y, 7T, 5]) — g([v,Y,h]) for 1 < v < p, where 
g is the step size. Actually, as only a small percentage of 
nodes have large degrees, we may speed up by using the 
condition g([v, Y, IT, 5]) = g([v,Y,h]) for 1 < v < k 
where k is much smaller than p. That is, instead of ranking 
all the nodes, we just need to select top k of them, which 
can be done much faster. We now propose Algorithm 1 to 
solve (16), which in spirit is a dual decomposition (Sontag 
et al., 2011 ) algorithm. 


Algorithm 1 Update of node ranking 

1 

Randomly generate Y°. Set t = 

0 , <5° 

= 0 and com- 


pute [y°, H , 5°) 



2 

while TRUE do 



3 

Y t+1 =argmin y |||y-A||| 



4 


H 


5 

if \Y t+1 ,H,5 t ] = [Y*, H, = 

- [Y t+l , h] then 

6 

break 



7 

else 



8 

^ t+1 = 6* + g(g([Y t+1 , H, 5 4 ]) - 

g([Y t+ \h})) 

9 

end if 



10 

t = t + l 



11 

end while 




Theorem 2. The output Y' of algorithm 1 satisfies the fol¬ 
lowing condition. 

1 P 

Y' = argmin{-||y - A\\ 2 F + 9(v)Y[ v , Y ',h] ° H } 

~ U = 1 

(17) 


See supplementary for the proof of Theorem 2. 

Solving line 3-4 in Algorithm 1 is not trivial. We can re¬ 
formulate it as follows. 

p i 

mi n ^{-||y„ - A V \\ 2 F + Y v o H'{v)}. (18) 

11 = 1 

Here Hfv) is a p — 1 dimension vector and H' u (v) = 
\g(v)H u for 1 < u < p — 1. Since Y is symmetric, (18) 
can be reformulated as follows. 

p i 

imn^2{-\\Y v ~ A v \\ 2 f +Y v oH'(v)} (19) 

11 = 1 

s.t. Y = Y t . 

Problem (19) can be solved iteratively using dual decompo¬ 
sition by introducing the Lagrangian term tr(a(Y — y T )), 
where o is a p by p matrix which would be updated by 
a = a + g(Y — Y T ). Notice that tr{a(Y — Y T )) = 
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tr((cr — a T )Y ), (19) can be decomposed into p indepen¬ 
dent sub-problems as follows. 

min 1\\Y V - (A + a T - <j) v \\ 2 f + Y v o H' (v) (20) 


Let B = A + a T — <t. Obviously Y v v = B VtV holds, 
so we do not consider Y v v in the remaining section. Let 
Y v ={Y Vi ^, Lu,( 2 )i Yv,(p-i)} be a feasible solution of 

(20). We define the cluster structure of Y v as follows. 

Definition 3 Let {L),,(i), •••, ^.(p-i)} be a ranked 

feasible solution. Supposing that 1 ^,( 1 ) | = |Yy,( 2 )| = 

... |k,(t)l > |, we say {Ly,(i)?k,( 2)3 

form cluster 1 and denote it as C\ y (1). Similarly, we 
can define C\y (2), C\y (3) and so on. Assume that 
{y„ i( i), 1 ^,( 2 ), Y v ,( p - 1 )} is clustered into T(Y V ) groups. 
Let \C\-p (fc)| denote the size of C\y (k) and y\y (k) the 
absolute value of its element. 

Assuming that Y* is the optimal solution of (20), by Defini¬ 
tion 3, for 1 < k < T(Y*), Y* has the following property. 


y\Y*(k) = max{ 0 , 


SieClyj (fe)i \Bv,(i) I 

|c| n *(fc)| 


- } 


( 21 ) 

See (Defazio & Caetano, 2012) and our supplementary ma¬ 
terial for detailed proof of (21). Based on (21), we propose 
a novel dynamic programming algorithm to solve ( 20 ), 
which can be reduced to the problem of finding the con¬ 


strained optimal partition of {£?„ (!), B v ( 2 ),..., B v ^ p _i)}. 


Let Y* 1:t = {C\ Y * ut {l),C\ Y » i t (2), ..., C\ Y * it (m = 
T ( Y * 1:t ))} be the optimal partition for 
}• Let f 1) = 

{|^,(t+i)l - H’ t+ i(v)} and C|y* 1 ;t (0) = {oo}. Then we 
have the following theorem. 


Theorem 3. Y* vt+1 = {C\ Y:i Jl), ...,C\ Yv%t (k - 
l),Cfc}, whereCk is a set with Y™+^\C\ y * ^ t (s)| elements 
which are equal to yk. 


y k = max{ 0 , 


£i, 


6 U”fc 0|v* l!t ( 


}) {\B Vt{i) \-H'(v)} 


s™ + fe 1 ic'k* 1:t (a)l 


} 


( 22 ) 


Algorithm 2 Edge rank updating 
1 : Input: B v and H'(v) 

2 : Output: Y* 

3. Sort B v , get {B v ^, B v ^ 2 ) 5 •••> B v ^ p _ } 

4: Initialize t = 0, sum( 0) = 0 and Y-j* 0 = 0 
5: t = t + 1 

6 : while t < p do 

7: sum(t ) = sum(t — 1) + \B V ( t )| — H[(y) 

8 : m = T(Yy t _ 1 ), C\ Y * 1:t (m + 1) = {|B„, (t+1) | - 

H 't +iWI and C\ Y * 1:t (0) = { 00 } 

9: Binary search to find out the largest index b 

among 1 , ...,m + 1 , such that y\ Y * i Jb~ !) > 


max {0 


1 


(3) 

,1 :t 

*1 cw 


*,.A S ) 


-} 


10: S=\dZl\C\ Y . t _ 1 (s)\ 


Set C\ Y * t (b) = 


r _ , , sum(i) — sum(S) 

{Rep(max{ -— ——, 0}, t - S) 

t o 


11: n*t = uSllC|y._ t _ 1 ( a ) + C|y.. t (6) 

12: end while 


4. Related Work & Hyper Parameters 

To introduce scale-free prior (p(d) ex d~ a ), (Liu & Ih- 
ler, 2011 ) proposed to approximate the degree of node i 
by ||_A_j ||! = Yhj+i ki.i and then use the following ob¬ 
jective function. 

p p 

F(X) + a^logdlX^II, + a) + pY, I*mI> (23) 

i=l j= 1 

where e, is the parameter for each node to smooth out the 
scale-free distribution. Without prior knowledge, it is not 
easy to determine the value of f , . Note that the above ob¬ 
jective function is not convex even when F(x) is convex 
because of the log-sum function involved. The objective is 
optimized by reweighing the penalty for each X, 7 at each 
step, and the method (denoted as RW) is guaranteed to con¬ 
verge to a local optimal. The parameter e is set as diagonal 
of estimated A' in previous iteration, and /? as 2 — , as sug¬ 
gested by the authors. They use a to control the sparsity, 
i.e. the number of predicted edges. 


and k is the largest value such that y |y* t {k - 1) > y k - 

Theorem 3 clearly shows that this problem satisfies the op¬ 
timal substructure property and thus, can be solved by dy¬ 
namic programming. A 0(p\og(p)) algorithm (Algorithm 
2) to solve (20) is proposed. See supplementary material 
for the proof of substructure property and the correctness 
of the 0(plog(p)) algorithm. In Algorithm 2, Rep(x 1 p) 
duplicates x by p times. 


Recently (Defazio & Caetano, 2012) proposed a Lovasz ex¬ 
tension approach to approximate node degree by a convex 
function. The convex function is a reweighed 1\ with larger 
penalty applied to edges with relatively larger strength. It 
turns out that such kind of convex function prior does not 
work well when we just need to predict a few edges, as 
shown in the experiments. Further, both (Liu & Ihler, 2011) 
and (Defazio & Caetano, 2012) consider only the global de¬ 
gree distribution instead of the degree of each node. 
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(Tan et al., 2014) proposes a method (denoted as Hub) 
specifically for a graph with a few hubs and applies a 
group lasso penalty. In particular, they decompose X as 
Z + V + V T , where Z is a sparse symmetric matrix and 
V is a matrix whose column are almost entirely zero or 
non-zero. Intuitively, Z describes the relationship between 
non-hubs and V that between hubs. They formulate the 
problem as follows. 

min F(X) + AiHZHi + A 2 ||V|| + A 3 W V oh 

i=i 

s.t. X = Z + V + V T (24) 

An ADMM algorithm is used to solve this problem. In 
our test, we use A 3 = 0.01 to yield the best performance. 
Besides,we set A = Ai = A 2 to produce a graph with a 
desired level of sparsity. 

Our method uses 2 hyper-parameters: 7 and /?. Meanwhile, 
7 is the hyper parameter for the power-law distribution and 
A controls sparsity. 

5. Results 

We tested our method on two real gene expression datasets 
and two types of simulated networks: ( 1 ) a scale-free net¬ 
work generated by Barabasi-Albert (BA) model (Albert & 
Barabasi, 2002) and (2) a network with a few hub nodes. 
We generated the data for the simulated scale-free net¬ 
work by its corresponding Multivariate Gaussian distribu¬ 
tion. We compared our method (denoted as ”DNS”, short 
for ’’Dynamic Node-Specific Prior”) with graphical lasso 
C’Glasso”) (Friedman et al., 2008), neighborhood selection 
(”NS”) (Meinshausen & Biihlmann, 2006), reweighted li 
regularization (”RW”) (Liu & Ihler, 2011), Lovasz exte- 
nion (’’Lovasz”) (Defazio & Caetano, 2012) and a recent 
hub detection method (’’Hub”) (Tan et al., 2014). 

5.1. Performance on Scale-Free Networks 

We generated a network of 500 nodes (p = 500) from the 
BA model. Each entry f 1 UV of the precision matrix is set 
to 0.3 if (u,v) forms an edge, and 0 otherwise. To make 
the precision matrix positive definite, we set the diagonal 
of fl to the minimum eigenvalue of O plus 0.2. In total we 
generate 250 data samples from the corresponding multi¬ 
variate Gaussian distribution ( i.e.,n = 250). The hyper¬ 
parameters of all the methods are set as described in the 
last section. 

Our method is not sensitive to the hyper-parameter 7. As 
shown in Figure 2, a few different 7 values (2.1, 2.5, and 
2.9) yield almost the same result. Hence we use 7 = 2.5 in 
the following experiments. 



Figure 2. Simulation results on a scale free network. Gaussian 
Graphical Model is used with n = 250 andp = 500. X-axis is the 
number of predicted edges and Y-axis is the number of correctly 
predicted edges. 


forms the others in terms of prediction accuracy. It is not 
surprising that both RW and Hub outperform Glasso and 
NS since the former two methods are specifically designed 
for scale-free or hub networks. Lovasz, which is also de¬ 
signed for scale-free networks, would outperform Glasso 
as the number of predicted edges increase. 

Figure 1 displays the log-log degree distribution of the 
true network and the networks estimated by DNS, RW and 
Glasso. Both DNS and RW yield networks satisfying the 
power-law distribution while Glasso does not, which con¬ 
firms that both DNS and RW indeed favor scale-free net¬ 
works. 


Figure 3. Visualization of the hub network with 1643 nodes and 
3996 edges. For visualization purpose we ignore a connected 
component with less than or equal to 4 nodes. We also highlight 
the hub nodes, whose degree is at least 30. 


Moreover, as shown in Figure 2, our method DNS outper- 
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Figure 1. From left to right, the log-log degree distribution of (1) the true network and the estimated networks by (2) DNS (3) RW and 
(4) Glasso, respectively. Linear relationship is expected since the true network is scale-free. The network yielded by Glasso violates the 
power-law distribution most, as evidenced by a point close to (2,0). 




Figure 4. Simulation results of a hub network. Gaussian Graph¬ 
ical Model is used with n = 806 and p = 1643. X-axis is the 
number of predicted edges and Y-axis is the number of correctly 
predicted edges. 


Figure 5. The performance of all the tested methods on a real gene 
expression data set (DREAM5 dataset 3). .Y-axis is the number of 
predicted edges and Y -axis is the number of correctly predicted 
edges. 


5.2. Performance on Hub Networks 

We also tested our method on a hub network, which con¬ 
tains a few nodes with very large degrees but not strictly 
follows the scale-free property. See Figure 3 for a visual¬ 
ization, where larger dots indicate the hub nodes. Here we 
use the DREAM5 Network Inference Challenge dataset 1, 
which is a simulated gene expression data with 806 sam¬ 
ples. DREAM5 also provides a ground truth network for 
this dataset. See (Marbach et ah, 2012) for more details. 

The result in Figure 4 shows that our method outperforms 
all the others, although our method is not specifically de¬ 
signed for hub networks. This shows that DNS also per¬ 
forms well in a graph with non-uniform degree distribution 
but without strict scale-free property. 


5.3. Gene Expression Data 

To further test our method, we used DREAM5 dataset 3 and 
4 respectively. Dataset3 contains 805 samples and 4511 
genes and its ground truth consists of 3902 edges. Dataset4 
contains 536 samples and 5950 genes and its ground truth 
consists of 3940 edges. The two datasets are very challeng¬ 
ing as the data is noisy and without Gaussian and scale-free 
property. For each dataset, up to 100,000 predicted edges 
are allowed for submission for each team competing in the 
contest. See (Marbach et al., 2012) for a detailed descrip¬ 
tion of the two data sets. We determine the hyper param¬ 
eters of all the methods such that they output exactly the 
same number of edges. As shown in Figure 5, our method 
obtains much higher accuracy than the others on DREAM5 
dataset 3. To compare the degree distribution of different 
methods, we chose the values of the hyper-parameters such 
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Figure 6. From left to right, the log-log degree distribution plot for (1) the true network and the estimated networks by (2) Glasso and 
(3) DNS, respectively. 


that each method yields 4000 edges. The degree distribu¬ 
tions of the resultant networks are shown in Figure 6. As 
shown in this figure, our estimated network is more similar 
to a scale-free network. 

We also tested our result on DREAM5 dataset 4. As most 
algorithms are time consuming, we just nin our method 
DNS and Glasso. According to our test, both our algorithm 
and Glasso perform not very good on this dataset (Actually, 
all DREAM5 competitors do not do well on this dataset 
(Marbach et ah, 2012)). But our algorithm still outperforms 
Glasso in terms of accuracy. Actually, the accuracy of DNS 
is about two times of Glassso (e.g. when predicting about 
19000 edges, Glasso generate 9 correct edges while DNS 
find 18 correct edges). 

6. Conclusions 

We have presented a novel node-specific degree prior 
to study the inference of scale-free networks, which are 
widely used to model social and biological networks. Our 
prior not only promotes a desirable global node degree dis¬ 
tribution, but also takes into consideration the estimated de¬ 
gree of each individual node and the relative strength of all 
the possible edges adjacent to the same node. To fulfill 
this, we have developed a ranking-based algorithm to dy¬ 
namically model the degree distribution of a given network. 
The optimization problem resulting from our prior is quite 
challenging. We have developed a novel ADMM algorithm 
to solve it. 

We have demonstrated the superior performance of our 
prior using simulated data and three DREAM5 datasets. 
Our prior greatly outperforms the others in terms of the 
number of correctly predicted edges, especially on the real 
gene expression data. 


The idea presented in this paper can potentially be use¬ 
ful to other degree-constrained network inference problem. 
In particular, it might be applied to infer protein residue- 
residue interaction network from a multiple sequence align¬ 
ment, for which we may predict the degree distribution of 
each residue using a supervised machine learning method. 
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